Skip to content

Comments

feat: batch video generation endpoint#399

Draft
ryanontheinside wants to merge 16 commits intomainfrom
ryanontheinside/feat/generate-endpoint
Draft

feat: batch video generation endpoint#399
ryanontheinside wants to merge 16 commits intomainfrom
ryanontheinside/feat/generate-endpoint

Conversation

@ryanontheinside
Copy link
Collaborator

@ryanontheinside ryanontheinside commented Feb 4, 2026

Summary

Adds /api/v1/batch — an HTTP batch video generation API with SSE progress streaming. Processes video chunk-by-chunk with file-based binary transfer for inputs and outputs.

Primary consumer is the ComfyUI custom nodes (comfyui-scope), but the endpoint is client-agnostic.

Endpoints

Endpoint Method Purpose
/api/v1/batch POST Generate video (SSE stream)
/api/v1/batch/cancel POST Cancel after current chunk
/api/v1/batch/upload POST Upload input video for v2v
/api/v1/batch/upload-data POST Upload binary data blob (VACE frames/masks, per-chunk video)
/api/v1/batch/download GET Download output video

Key design decisions

Why not use the existing WebRTC streaming path?

Maximum quality and accuracy is the objective. WebRTC compression, frame dropping, and streaming-specific infrastructure are not only irrelevant but contrary to the goal. Batch generation needs lossless frame transfer and deterministic chunk processing.

Why a single binary blob with chunk spec offsets?

The ChunkSpec + blob approach allows a single generation request to dynamically mix t2v, v2v, VACE conditioning, and VACE masking on a per-chunk basis. A client packs all binary data (VACE frames/masks, per-chunk input video) into one blob upload, then references regions by byte offset in each ChunkSpec. This avoids multiple upload round-trips and keeps the JSON request clean.

Why SSE instead of WebSocket?

Progress reporting is unidirectional (server → client). SSE is plain HTTP — no connection upgrade, works with any HTTP client, trivial to parse. WebSocket would add complexity for no benefit.

Why server-side temp files instead of in-memory?

Video tensors can be gigabytes. Streaming output to disk chunk-by-chunk keeps memory bounded to one chunk at a time rather than accumulating the full video in memory.

Why module-level threading primitives?

Pipeline inference is synchronous and GPU-bound. The generation runs in a sync generator consumed by FastAPI's StreamingResponse. Async locks would require restructuring the pipeline call path for no gain. Single-client constraint (one generation at a time, 409 if busy) is enforced via threading.Lock.

Changes

File What
generate.py (new) Core generation engine: input decoding, ChunkSpec → pipeline kwargs, SSE streaming, processor chaining, temp file lifecycle
schema.py GenerateRequest, ChunkSpec, GenerateResponse, upload/download response schemas
app.py Five new endpoint handlers + concurrent generation guard + blob size limit
pipeline_processor.py batch_mode for queue-based chunk processing (reuses existing processor chaining)
recording.py Temp file prefixes for generate input/output/data
docs/api/generate.md Binary protocol documentation
scripts/test_generate_endpoint.py Manual test script covering LoRA ramp, v2v, VACE conditioning, inpainting

Per-chunk control via ChunkSpec

Every generation parameter can be overridden per-chunk. Only fields that change need to be specified — prompts are sticky (last-set value persists). Example:

{
  "pipeline_id": "longlive",
  "prompt": "a cat walking",
  "num_frames": 96,
  "seed": 42,
  "chunk_specs": [
    {"chunk": 0, "lora_scales": {"/path/to/lora.safetensors": 0.0}},
    {"chunk": 3, "text": "a cat jumping", "lora_scales": {"/path/to/lora.safetensors": 0.5}},
    {"chunk": 5, "vace_frames_offset": 0, "vace_frames_shape": [1, 3, 12, 320, 576]}
  ],
  "data_blob_path": "<from /batch/upload-data>"
}

Limitations / follow-ups

  • Chunk size is determined server-side; clients cannot query it before generating (separate PR planned to expose this)
  • Single-instance only (upload paths are local filesystem references)

Test plan

  • uv run daydream-scope starts without errors
  • Text-to-video generation
  • Video-to-video generation
  • VACE depth/structure conditioning
  • VACE inpainting with masks
  • LoRA scale ramping across chunks
  • Per-chunk prompt keyframing
  • Cancellation mid-generation
  • Concurrent generation rejection (409)
  • ComfyUI ScopeSampler node integration
  • Tested with LongLive pipeline
  • Tested with StreamDiffusionV2 pipeline

@ryanontheinside ryanontheinside force-pushed the ryanontheinside/feat/generate-endpoint branch from c2b5afb to 50e33a1 Compare February 4, 2026 19:31
@ryanontheinside ryanontheinside marked this pull request as draft February 5, 2026 20:02
@ryanontheinside ryanontheinside force-pushed the ryanontheinside/feat/generate-endpoint branch from e1af42b to 1f74776 Compare February 6, 2026 18:49
@ryanontheinside ryanontheinside force-pushed the ryanontheinside/feat/generate-endpoint branch from 1f74776 to a94212c Compare February 13, 2026 20:55
# Add batch video generation endpoint with SSE streaming

## Summary

Adds `/api/v1/generate` endpoint for batch video generation with server-side chunking and SSE progress streaming. Supports text-to-video, video-to-video, VACE conditioning, and comprehensive per-chunk parameter scheduling.

This is important for the ComfyUI node wrapper for Scope. It also could conceivably replace the test.py/test_vace.py, or at least their boiler plate code.

## Changes

- **`schema.py`**: Add `GenerateRequest`/`GenerateResponse` models with `EncodedArray` for binary data
- **`generate.py`**: New module handling chunked generation with SSE progress events
- **`app.py`**: Wire up the endpoint
- **`test_generate_endpoint.py`**: Integration tests for v2v, depth, inpainting, LoRA ramps
- **ComfyUI nodes**: Update `ScopeSampler` to use new schema

## Features

### Generation modes
- **Text-to-video**: Generate from prompt alone
- **Video-to-video**: Transform input video with configurable noise scale

### VACE conditioning
- **Reference images**: Style/identity conditioning via image paths
- **Depth/structure guidance**: Pass conditioning frames for structural control
- **Inpainting**: Binary masks specify regions to regenerate vs preserve

### Per-chunk parameter scheduling

All scheduling parameters accept either a single value (applied to all chunks) or a list (applied per-chunk, last value repeats if list is shorter than chunk count).

| Parameter | Type | Description |
|-----------|------|-------------|
| `seed` | `int \| list[int]` | Random seed per chunk |
| `noise_scale` | `float \| list[float]` | V2V noise injection strength |
| `vace_context_scale` | `float \| list[float]` | VACE conditioning influence |
| `lora_scales` | `dict[str, float \| list[float]]` | Per-LoRA strength scheduling |

### Sparse keyframe updates

These parameters use a chunk-indexed specification, only sending updates when values change (sticky behavior).

| Parameter | Type | Description |
|-----------|------|-------------|
| `chunk_prompts` | `list[{chunk, text}]` | Prompt changes at specific chunks |
| `first_frames` | `list[{chunk, image}]` | First frame anchors for extension mode |
| `last_frames` | `list[{chunk, image}]` | Last frame anchors for extension mode |
| `vace_ref_images` | `list[{chunk, images}]` | Reference images at specific chunks |

## Design decisions

Some features were left out of this PR for simplicity (eg, prompt spatial/temporal blending). They can be added or included in a follow up.
### SSE streaming

Clients, like test files or ComfyUI nodes, need performance and progress updates. SSE provides per-chunk progress updates without requiring WebSocket infrastructure:

```
event: progress
data: {"chunk": 1, "total_chunks": 8, "fps": 4.2, "latency": 2.85}

event: progress
data: {"chunk": 2, "total_chunks": 8, "fps": 4.5, "latency": 2.67}

event: complete
data: {"video_base64": "...", "video_shape": [96, 320, 576, 3], ...}
```

### Server-side chunking

The server determines chunk size from the pipeline, handles frame padding, and manages KV cache initialization. Callers specify total frames and per-chunk parameters—the server handles the rest.

## Example usage

### LoRA strength ramp (dissolve effect)

```python
request = GenerateRequest(
    pipeline_id="longlive",
    prompt="a woman dissolving into particles",
    num_frames=96,  # 8 chunks × 12 frames
    lora_scales={
        "path/to/dissolve.safetensors": [0.0, 0.15, 0.3, 0.5, 0.7, 0.85, 1.0, 1.0]
    },
)
```

### Video-to-video with prompt changes

```python
request = GenerateRequest(
    pipeline_id="longlive",
    prompt="a cat sitting calmly",
    chunk_prompts=[
        {"chunk": 3, "text": "a cat jumping"},
        {"chunk": 6, "text": "a cat landing gracefully"},
    ],
    input_video=EncodedArray(base64="...", shape=[96, 512, 512, 3]),
    noise_scale=0.6,
)
```

### Depth-guided generation

```python
request = GenerateRequest(
    pipeline_id="longlive",
    prompt="a robot walking through a forest",
    vace_frames=EncodedArray(base64="...", shape=[1, 3, 48, 320, 576]),
    vace_context_scale=1.5,
)
```

## Test plan

- [x] `uv run daydream-scope` starts without errors
- [x] V2V generation produces correct output
- [x] VACE depth conditioning works
- [x] VACE inpainting with masks works
- [x] LoRA scale ramping works across chunks
- [x] Per-chunk noise scale scheduling works
- [x] Prompt keyframing updates at correct chunks
- [x] ComfyUI ScopeSampler node works (WIP)
- [x] Test with Longlive
- [x] Same test with StreamDiffusionv2

Signed-off-by: RyanOnTheInside <7623207+ryanontheinside@users.noreply.github.com>
enables rife

Signed-off-by: RyanOnTheInside <7623207+ryanontheinside@users.noreply.github.com>
Signed-off-by: RyanOnTheInside <7623207+ryanontheinside@users.noreply.github.com>
Signed-off-by: RyanOnTheInside <7623207+ryanontheinside@users.noreply.github.com>
- Reuse RecordingManager temp file pattern for large video I/O                                                                                                                                    - Add POST /generate/upload and GET /generate/download endpoints
  - Write output chunks incrementally to disk (constant memory)
  - Add generate_input/generate_output prefixes to TEMP_FILE_PREFIXES

Signed-off-by: RyanOnTheInside <7623207+ryanontheinside@users.noreply.github.com>
Signed-off-by: RyanOnTheInside <7623207+ryanontheinside@users.noreply.github.com>
Signed-off-by: RyanOnTheInside <7623207+ryanontheinside@users.noreply.github.com>
Signed-off-by: RyanOnTheInside <7623207+ryanontheinside@users.noreply.github.com>
Signed-off-by: RyanOnTheInside <7623207+ryanontheinside@users.noreply.github.com>
Signed-off-by: RyanOnTheInside <7623207+ryanontheinside@users.noreply.github.com>
Signed-off-by: RyanOnTheInside <7623207+ryanontheinside@users.noreply.github.com>
Signed-off-by: RyanOnTheInside <7623207+ryanontheinside@users.noreply.github.com>
Signed-off-by: RyanOnTheInside <7623207+ryanontheinside@users.noreply.github.com>
Signed-off-by: RyanOnTheInside <7623207+ryanontheinside@users.noreply.github.com>
@ryanontheinside ryanontheinside changed the title feat: generate endpoint with SSE streaming feat: batch video generation endpoint (/api/v1/generate) Feb 20, 2026
@ryanontheinside ryanontheinside force-pushed the ryanontheinside/feat/generate-endpoint branch from a94212c to d2c66b9 Compare February 20, 2026 17:00
Signed-off-by: RyanOnTheInside <7623207+ryanontheinside@users.noreply.github.com>
Signed-off-by: RyanOnTheInside <7623207+ryanontheinside@users.noreply.github.com>
@ryanontheinside ryanontheinside changed the title feat: batch video generation endpoint (/api/v1/generate) feat: batch video generation endpoint Feb 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant